104爬蟲-使用分詞工具統計工作內容中的出現詞頻率

2023 iThome 鐵人賽

自我挑戰組

定期推送油價通知到Line上的訊息通知，並使用GitLab CI排程搭配Google Colab系列第 16 篇

15th鐵人賽

tester0716

2023-10-18 23:20:39

916 瀏覽

分享至

今天要來將前面已經蒐集好的工作內容做斷詞等分析,目前規劃的流程如下：

1.從 Excel 文件中讀取工作內容欄位資料並儲存到 job_descriptions.txt 檔案中。
2.統計該檔案中的詞頻
3.發現有標點符號及常用字
* 去掉標點符號
* 將常用字加入為停用詞
4.使用 Jieba套件中外部自訂辭典來進行分詞。
5.將分詞結果儲存到 segmented_job_descriptions.txt 檔案中。
6.使用 WordCloud 套件繪製詞雲圖，找出前三名的技能需求。

jieba套件斷詞使用方法: https://pypi.org/project/jieba/

預計大概是這樣的作法:

import jieba
from collections import Counter

# 1. 讀取 '工作內容彙總.txt' 文件的內容
with open('工作內容彙總.txt', 'r', encoding='utf-8') as file:
    content = file.read()

# 2. 使用 Jieba 進行分詞
words = jieba.cut(content)

# 3. 統計詞頻
word_count = Counter(words)

# 4. 顯示或儲存詞頻結果
# 顯示詞頻結果
for word, count in word_count.most_common(10):  # 顯示前10個詞頻最高的詞語
    print(f'{word}: {count}')

# 如果你想儲存詞頻結果到文件，可以使用以下代碼
with open('詞頻結果.txt', 'w', encoding='utf-8') as output_file:
    for word, count in word_count.most_common():
        output_file.write(f'{word}: {count}\n')

以上流程大概是參考了這些大大的資料: